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ABSTRACT 

Within the past few years, organizations in diverse indus- 
tries have adopted MapReduce-based systems for large-scale 
data processing. Along with these new users, important new 
workloads have emerged which feature many small, short, 
and increasingly interactive jobs in addition to the large, 
long-running batch jobs for which MapReduce was origi- 
nally designed. As interactive, large-scale query processing 
is a strength of the RDBMS community, it is important that 
lessons from that field be carried over and applied where 
possible in this new domain. However, these new workloads 
have not yet been described in the literature. We fill this 
gap with an empirical analysis of MapReduce traces from six 
separate business-critical deployments inside Facebook and 
at Cloudera customers in e-commerce, telecommunications, 
media, and retail. Our key contribution is a characteriza- 
tion of new MapReduce workloads which are driven in part 
by interactive analysis, and which make heavy use of query- 
like programming frameworks on top of MapReduce. These 
workloads display diverse behaviors which invalidate prior 
assumptions about MapReduce such as uniform data ac- 
cess, regular diurnal patterns, and prevalence of large jobs. 
A secondary contribution is a first step towards creating a 
TPC-like data processing benchmark for MapReduce. 

I. INTRODUCTION 

Many organizations depend on MapReduce to handle their 
large-scale data processing needs. As companies across di- 
verse industries adopt MapReduce alongside parallel data- 
bases [5] , new MapReduce workloads have emerged that fea- 
ture many small, short, and increasingly interactive jobs. 
These workloads depart from the original MapReduce use 
case targeting purely batch computations, and shares se- 
mantic similarities with large-scale interactive query pro- 
cessing, an area of expertise of the RDBMS community. 
Consequently, recent studies on query-like programming ex- 
tensions for MapReduce [14,27,49] and applying query opti- 
mization techniques to MapReduce [16,23,26,31,34,43] are 
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likely to bring considerable benefit. However, integrating 
these ideas into business-critical systems requires configu- 
ration tuning and performance benchmarking against real- 
life production MapReduce workloads. Knowledge of such 
workloads is currently limited to a handful of technology 
companies [8,11,17,38,41,48]. A cross-workload comparison 
is thus far absent, and use cases beyond the technology in- 
dustry have not been described. The increasing diversity of 
MapReduce operators create a pressing need to characterize 
industrial MapReduce workloads across multiple companies 
and industries. 

Arguably, each commercial company is rightly advocat- 
ing for their particular use cases, or the particular problems 
that their products address. Therefore, it falls to neutral 
researchers in academia to facilitate cross-company collabo- 
ration, and mediate the release of cross-industries data. 

In this paper, we present an empirical analysis of seven 
industrial MapReduce workload traces over long-durations. 
They come from production clusters at Facebook, an early 
adopter of the Hadoop implementation of MapReduce, and 
at e-commerce, telecommunications, media, and retail cus- 
tomers of Cloudera, a leading enterprise Hadoop vendor. 
Cumulatively, these traces comprise over a year's worth of 
data, covering over two million jobs that moved approxi- 
mately 1.6 exabytes spread over 5000 machines (Table 1). 
Combined, the traces offer an opportunity to survey emerg- 
ing Hadoop use cases across several industries (Cloudera 
customers), and track the growth over time of a leading 
Hadoop deployment (Facebook). We believe this paper is 
the first study that looks at MapReduce use cases beyond 
the technology industry, and the first comparison of multiple 
large-scale industrial MapReduce workloads. 

Our methodology extends [17-19], and breaks down each 
MapReduce workload into three conceptual components: da- 
ta, temporal, and compute patterns. The key findings of our 
analysis are as follows: 

• There is a new class of MapReduce workloads for interac- 
tive, semi-streaming analysis that notably differs from the 
original use case targeting purely batch computations. 

• There is a wide range of behavior within this workload 
class, such that we must exercise caution in regarding any 
aspect of workload dynamics as "typical" . 

• Query-like programatic frameworks on top of MapReduce 
such as Hive and Pig make up a considerable fraction of 
activity in all workloads we analyzed. 

• Some prior assumptions about MapReduce such as uni- 
form data access, regular diurnal patterns, and prevalence 
of large jobs no longer hold. 
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Subsets of these observations have emerged in several studies 
that each looks at only one MapReduce workload [11, 14, 
18, 19,27,49]. Identifying these characteristics across a rich 
and diverse set of workloads shows that the observations are 
applicable to a range of use cases. 

We view this class of MapReduce workloads for interac- 
tive, semi-streaming analysis as a natural extension of in- 
teractive query processing. Their prominence arises from 
the ubiquitous ability to generate, collect, and archive data 
about both technology and physical systems [24], as well 
as the growing statistical literacy across many industries to 
interactively explore these datasets and derive timely in- 
sights [5,14,33,39]. The semantic proximity of this MapRe- 
duce workload to interactive query processing suggests that 
optimization techniques for one likely translate to the other, 
at least in principle. However, the diversity of behavior even 
within this same MapReduce workload class complicates ef- 
forts to develop generally applicable improvements. Conse- 
quently, ongoing MapReduce studies that draw on database 
management insights would benefit from checking workload 
assumptions against empirical measurements. 

The broad spectrum of workloads analyzed allows us to 
identify the challenges associated with constructing a TPC- 
style big data processing benchmark for MapReduce. Top 
concerns include the complexity of generating representative 
data and processing characteristics, the lack of understand- 
ing about how to scale down a production workload, the 
difficulty of modeling workload characteristics that do not 
fit well-known statistical distributions, and the need to cover 
a diverse range of workload behavior. 

The rest of the paper is organized as follows. We re- 
view prior work on workload-related studies (§ 2) and de- 
velop hypotheses about MapReduce behavior using existing 
mental models. We then describe the MapReduce work- 
load traces (§ 3). The next few sections present empirical 
evidence that describe properties of MapReduce workloads 
for interactive, semi-streaming analysis, which depart from 
prior assumptions about MapReduce as a mostly batch pro- 
cessing paradigm. We discuss data access patterns (§ 4), 
workload arrival patterns (§ 5), and compute patterns (§ 6). 
We detail the challenges these workloads create for building 
a TPC-style benchmark for MapReduce (§ 7), and close the 
paper by summarizing the findings, reflecting on the broader 
implications of our study, and highlighting future work (§ 8). 

2. PRIOR WORK 

The desire for thorough system measurement predates the 
rise of MapReduce. Workload characterization studies have 
been invaluable in helping designers identify problems, ana- 
lyze causes, and evaluate solutions. 

Workload characterization for database systems culmi- 
nated in the TPC-* series of benchmarks [51], which built 
on industrial consensus on representative behavior for trans- 
actional processing workloads. Industry experience also re- 
vealed specific properties of such workloads, such as Zipf 
distribution of data accesses [28], and bimodal distribution 
of query sizes [35]. Later in the paper, we see that some of 
these properties also apply to the MapReduce workloads we 
analyzed. 

The lack of comparable insights for MapReduce has hin- 
dered the development of a TPC-like MapReduce bench- 
mark suite that has a similar level of industrial consen- 
sus and representativeness. As a stopgap alternative, some 



MapReduce microbenchmarks aim to faciliate performance 
comparison for a small number of large-scale, stand-alone 
jobs [4,6,45], an approach adopted by a series of stud- 
ies [23,31,34,36]. These microbenchmarks of stand-alone 
jobs remain different from the perspective of TPC-* bench- 
marks, which views a workload as a complex superposition 
of many jobs of various types and sizes [50] . 

The workload perspective for MapReduce is slowly emerg- 
ing, albeit in point studies that focus on technology industry 
use cases one at a time [8,11,12,38,41,48]. The stand- 
alone nature of these studies forms a part of an interest- 
ing historical trend for workload-based studies in general. 
Studies in the late 1980s and early 1990s capture system 
behavior for only one setting [37, 44] , possibly due to the 
nascent nature of measurement tools at the time. Stud- 
ies in the 1990s and early 2000s achieve greater general- 
ity [13,15,25,40,42,46], likely due to a combination of im- 
proved measurement tools, wide adoption of certain systems, 
and better appreciation of what good system measurement 
enables. Stand-alone studies have become common again in 
recent years [8,11,17,38,41,48], likely the result of only a 
few organizations being able to afford large-scale systems. 

The above considerations create the pressing need to gen- 
eralize beyond the initial point studies for MapReduce work- 
loads. As MapReduce use cases diversify and (mis) engineer- 
ing opportunities proliferate, system designers need to op- 
timize for common behavior, in addition to improving the 
particulars of individual use cases. 

Some studies amplified their breadth by working with 
ISPs [25,46] or enterprise storage vendors [13], i.e., interme- 
diaries who interact with a large number of end customers. 
The emergence of enterprise MapReduce vendors present 
us with similar opportunities to look beyond single-point 
MapReduce workloads. 

2.1 Hypotheses on Workload Behavior 

One can develop hypotheses about workload behavior bas- 
ed on prior work. Below are some key questions to ask about 
any MapReduce workload. 

1. For optimizing the underlying storage system: 

— How uniform or skewed are the data accesses? 

— How much temporal locality exists? 

2. For workload- level provisioning and load shaping: 

— How regular or unpredictable is the cluster load? 

— How large are the bursts in the workload? 

3. For job-level scheduling and execution planning: 

— What are the common job types? 

— What are the size, shape, and duration of these jobs? 

— How frequently does each job type appear? 

4. For optimizing query-like programming frameworks: 

— What % of cluster load comes from these frame- 
works? 

— What are the common uses of each framework? 

5. For performance comparison between systems: 

— How much variation exists between workloads? 

— Can we distill features of a representative workload? 
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Using the original MapReduce use case of data indexing in 
support of web search [22] and the workload assumptions be- 
hind common microbenchmarks of stand-alone, large-scale 
jobs [4, 6, 45] , one would expect answers to the above to be: 
(1). Some data access skew and temporal locality exists, 
but there is no information to speculate on how much. (2). 
The load is sculpted to fill a predictable web search diur- 
nal with batch computations; bursts are not a concern since 
new load would be admitted conditioned on spare cluster 
capacity. (3). The workload is dominated by large-scale 
jobs with fixed computation patterns that are repeatedly 
and regularly run. (4). We lack information to speculate 
how and how much query-like programming frameworks are 
used. (5). We expect small variation between different use 
cases, and the representative features are already captured 
in publications on the web indexing use case and existing 
microbenchmarks . 

Several recent studies offered single use case counter-points 
to the above mental model [If, 14, 18, 19, 27, 49]. The data 
in this paper allow us to look across use cases from several 
industries to identify an alternate workload class. What sur- 
prised us the most is (1). the tremendous diversity within 
this workload class, which precludes an easy characterization 
of representative behavior, and (2). that some aspects of 
workload behavior are polar opposites of the original large- 
scale data indexing use case, which warrants efforts to revisit 
some MapReduce design assumptions. 

3. WORKLOAD TRACES OVERVIEW 

We analyze seven workloads from various Hadoop deploy- 
ments. All seven come from clusters that support business- 
critical processes. Five are workloads from Cloudera's en- 
terprise customers in e-commerce, telecommunications, me- 
dia, and retail. Two others are Facebook workloads on the 
same cluster across two different time periods. These work- 
loads offer an opportunity to survey Hadoop use cases across 
several technology and traditional industries (Cloudera cus- 
tomers), and track the growth of a leading Hadoop deploy- 
ment (Facebook). 

Table 1 provides details about these workloads. The trace 
lengths are limited by the logistical feasibility of shipping 
the trace data for offsite analysis. The Cloudera customer 
workloads have raw logs approaching 100GB, requiring us 
to set up specialized file transfer tools. Transferring raw 
logs is infeasible for the Facebook workloads, requiring us 
to query Facebook's internal monitoring tools. Combined, 
the workloads contain over a year's worth of trace data, 
covering a significant amount of jobs and bytes processed 
by the clusters. 

The data comes from standard logging tools in Hadoop; 
no additional tools were necessary. The workload traces 
contain per-job summaries for job ID (numerical key), job 
name (string), input/shuffle/output data sizes (bytes), du- 
ration, submit time, map/reduce task time (slot-seconds), 
map/reduce task counts, and input/output file paths (string). 
We call each of the numerical characteristic a dimension of 
a job. Some traces have some data dimensions unavailable. 

We obtained the Cloudera traces by doing a time-range 
selection of per-job Hadoop history logs based on the file 
timestamp. The Facebook traces come from a similar query 
on Facebook's internal log database. The traces reflect no 
logging interruptions, except for the cluster in CC-d, which 
was taken offline several times due to operational reasons. 
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1 month 


2011 
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CC-d 
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2011 
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CC-e 


100 


9 days 


2011 


10790 


590 TB 


FB-2009 


600 


6 months 


2009 


1129193 


9.4 PB 


FB-2010 


3000 


1.5 months 


2010 


1169184 


1.5 EB 


Total 


>5000 


pa 1 year 




2372213 


1.6 EB 



Table 1: Summary of traces. CC is short for "Cloudera 
Customer" . FB is short for "Facebook" . Bytes moved 
is computed by sum of input, shuffle, and output data 
sizes for all jobs. 
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1 KB MB GB TB 1 KB MB GB TB 1 KB MB GB TB 
Per-job input size Per-job shuffle size Per-job output size 



FB-2009 

FB-2010 




1 KB MB GB TB 1 KB MB GB TB 1 KB MB GB TB 
Per-job input size Per-job shuffle size Per-job output size 



Figure 1: Data size for each workload. Showing input, 
shuffle, and output size per job. 



There are some inaccuracies at trace start and termina- 
tion, due to partial information for jobs straddling the trace 
boundaries. The length of our traces far exceeds the typical 
job length on these systems, leading to negligible errors. To 
capture weekly behavior for CC-b and CC-e, we intentionally 
queried for 9 days of data to allow for inaccuracies at trace 
boundaries. 

4. DATA ACCESS PATTERNS 

Data manipulation is a key function of any data manage- 
ment system, so understanding data access patterns is cru- 
cial. Query size, data skew, and access temporal locality are 
key concerns that impact performance for RDBMS systems. 
The mirror considerations exist for MapReduce. Specifi- 
cally, this section answers the following questions: 

— How uniformly or skewed are the data accesses? 

— How much temporal locality exists? 

We begin by looking at per job data sizes, the equivalent 
of query size (§ 4.1), skew in access frequencies (§ 4.2), and 
temporal locality in data accesses (§ 4.3). 

4.1 Per-job Data Sizes 

Figure 1 shows the distribution of per-job input, shuffle, 
and output data sizes for each workload. Across the work- 
loads, the median per-job input, shuffle, and output sizes 
differ by 6, 8, and 4 orders of magnitude, respectively. Most 
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Output file rank by descending access frequency 



Figure 2: Log-log file access frequency vs. rank. Show- 
ing Zipf distribution of same shape (slope) for all work- 
loads. 



Figure 3: Access patterns vs. input file size. Showing 
cummulative fraction of jobs with input files of a certain 
size (top) and cummulative fraction of all stored bytes 
from input files of a certain size (bottom). 



jobs have input, shuffle, and output sizes in the MB to GB 
range. Thus, benchmarks of TB and above [4,6,45] captures 
only a narrow set of input, shuffle, and output patterns. 

From 2009 to 2010, the Facebook workloads' per-job input 
and shuffle size distributions shift right (become larger) by 
several orders of magnitude, while the per-job output size 
distribution shifts left (becomes smaller). Raw and inter- 
mediate data sets have grown while the final computation 
results have become smaller. One possible explanation is 
that Facebook's customer base (raw data) has grown, while 
the final metrics (output) to drive business decisions have 
remained the same. 

4.2 Skews in Access Frequency 

This section analyzes HDFS file access frequency and in- 
tervals based on hashed file path names. The FB-2009 and 
CC-a traces do not contain path names, and the FB-2010 
trace contains path names for input only. 

Figure 2 shows the distribution of HDFS file access fre- 
quency, sorted by rank according to non-decreasing frequency. 
Note that the distributions are graphed on log-log axes, and 
form approximately straight lines. This indicates that the 
file accesses follow a Zipf-like distribution, i.e., a few files 
account for a very high number of accesses. This obser- 
vation challenges the design assumption in HDFS that all 
data sets should be treated equally, i.e., stored on the same 
medium, with the same data replication policies. Highly 
skewed data access frequencies suggest a tiered storage ar- 
chitecture should be explored [12], and any data caching 
policy that includes the frequently accessed files will bring 
considerable benefit. Further, the slope parameters of the 
distributions are all approximately 5/6 across workloads and 
for both inputs and outputs. Thus, file access patterns are 
Zipf-like distributions of the same shape. Figure 2 suggests 
the existence of common computation needs that lead to the 
same file access behavior across different industries. 

The above observations indicate only that caching helps. 
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Figure 4: Access patterns vs. output file size. Showing 
cummulative fraction of jobs with output files of a certain 
size (top) and cummulative fraction of all stored bytes 
from output files of a certain size (bottom). 



If there is no correlation between file sizes and access fre- 
quencies, maintaining cache hit rates would require caching 
a fixed fraction of bytes stored. This design is not sustain- 
able, since caches intentionally trade capacity for perfor- 
mance, and cache capacity grows slower than full data ca- 
pacity. Fortunately, further analysis suggests more viable 
caching policies. 

Figures 3 and 4 show data access patterns plotted against 
input and output file sizes. The distributions for fraction of 
jobs versus file size vary widely (top graphs), but converge 
in the upper right corner. In particular, 90% of jobs ac- 
cesses files of less than a few GBs (note the log-scale axis). 
These files account for up to only 16% of bytes stored (bot- 
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Figure 5: Data re-accesses intervals. Showing interval 
between when an input file is re-read (top), and when an 
output is re- used as the input for another job (bottom). 




■ jobs whose input 
re-access pre- 
existing output 



■ jobs whose input 
re-access pre- 
existing input 



FB-2010 CC-b CC-c CC-d CC-e 



Figure 6: Fraction of jobs that reads pre-existing input 
path. Note that output path information is missing from 
FB-2010. 



torn graphs). Thus, a viable cache policy is to cache files 
whose size is less than a threshold. This policy allows cache 
capacity growth rates to be detached from the growth rate 
in data. 

Prior work has also observed Zipf-like distributed data 
access patterns for RDBMS workloads, culminating in the 
formulation of the 80-20 rule, i.e., 80% of the data access 
go to 20% of the data [28]. For MapReduce, the rule is 
more complicated. We need to consider both the input and 
output data sets, and the size of each data set. If we had 
just considered a Zipf log-log slope of 5/6, we would have 
arrived at a 80-40 rule. Figure 3 and 4 account for the size of 
data sets also, and indicate that 80% of jobs (data accesses) 
go to less than 10% of the stored bytes, for both input and 
output data sets. Depending on the workload, the access 
patterns range from an 80-1 rule to an 80-8 rule. 

4.3 Access Temporal Locality 

Further analysis also reveals temporal locality in the data 
accesses. Figure 5 indicates the distribution of time intervals 
between data re-accesses. 75% of the re-accesses take place 
within 6 hours. Thus, a possible cache eviction policy is to 
evict entire files that have not been accessed for longer than 
a workload specific threshold duration. Any similar policy 
to least-recently-used (LRU) would make sense. 

Figure 6 further shows that up to 78% of jobs involve 
data re-accesses (CC-c, CC-d, CC-e), while for other work- 



loads, the fraction is lower. Thus, the same cache eviction 
policy potentially translates to different benefits for different 
workloads. 

Combined, the observations in this section indicate that it 
will be non-trivial to preserve for performance comparisons 
the data size, skew in access frequency, and access temporal 
locality of the data. The analysis also reveals the tremen- 
dous diversity across workloads. Only one numerical feature 
remains relatively fixed across workloads — the shape param- 
eter of the Zipf-like distribution for data access frequencies. 
Consequently, we should be cautious in considering any as- 
pect of workload behavior as being "typical" . 

5. WORKLOAD VARIATION OVER TIME 

The temporal workload intensity variation has been an im- 
portant concern for RDBMS systems, especially ones that 
back consumer-facing systems subject to unexpected spikes 
in behavior. The transactions or queries per second metric 
quantifies the maximum stress that the system can handle. 
The analogous metric for MapReduce is more complicated, 
as each job or "query" in MapReduce potentially involves 
different amounts of data, and different amounts of compu- 
tation on the data. Actual system occupancy depends on 
the combination of these multiple time- varying dimensions, 
with thus yet unknown correlation between the dimensions. 

The empirical workload behavior over time has implica- 
tions for provisioning and capacity planning, as well as the 
ability to do load shaping or consolidate different workloads. 
Specifically, this section tries to answer the following: 

— How regular or unpredictable is the cluster load? 

— How large are the bursts in the workload? 

In the following, we look at workload variation over a 
week (§ 5.1), quantify burstiness, a common feature for all 
workloads (§ 5.2), and compute temporal correlations be- 
tween different workload dimensions (§ 5.3). Our analysis 
proceeds in four dimensions — the job submission counts, 
the aggregate input, shuffle, and output data size involved, 
the aggregate map and reduce task times, and the resulting 
system occupany in the number of active task slots. 

5.1 Weekly Time Series 

Figure 7 depicts the time series of four dimensions of 
workload behavior over a week. The first three columns 
respectively represents the cumulative job counts, amount 
of I/O (again counted from MapReduce API), and compu- 
tation time of the jobs submitted in that hour. The last col- 
umn shows cluster utilization, which reflects how the cluster 
serviced the submitted workload described by the preceding 
columns, and depends on the cluster hardware and execu- 
tion environment. 

Figure 7 shows all workloads contain a high amount of 
noise in all dimensions. As neither the signal nor the noise 
models are known, it is challenging to apply standard signal 
processing methods to quantify the signal to noise ratio of 
these time series. Further, even though the number of jobs 
submitted is known, it is challenging to predict how much 
I/O and computation will result. 

Some workloads exhibit daily diurnal patterns, revealed 
by Fourier analysis, and for some cases, are visually identifi- 
able (e.g., jobs submission for FB-2010, utilization for CC-e). 
In Section 7, we combine this observation with several oth- 
ers to speculate that there is an emerging class of interactive 
and semi-streaming workloads. 
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Figure 7: Workload behavior over a week. From left to right: (1) Jobs submitted per hour. (2) Aggregate I/O 
(i.e., input + shuffle 4- output) size of jobs submitted. (3) Aggregate map and reduce task time in task-hours of jobs 
submitted. (4) Cluster utilization in average active slots. From top row to bottom, showing CC-a, CC-b, CC-c, CC-d, CC-e, 
FB-2009, and FB-2010 workloads. Note that for CC-c, CC-d, and FB-2009, the utilization data is not available from the 
traces. Also note that some time axes are misaligned due to short, week-long trace lengths (CC-b and CC-e), or gaps 
from missing data in the trace (CC-d). 



Figure 7 offers visual evidence to indicate the diversity 
of MapReduce workloads. There is significant variation in 
the shape of the graphs for both different dimensions of the 
same workloads (rows) and for the same workload dimen- 
sion across different workloads (columns). Consequently, for 
cluster management problems that involve workload varia- 
tion over time scales, such as load scheduling, load shifting, 
resource allocation, or capacity planning, approaches de- 
signed for one workload may be suboptimal or even counter- 
productive for another. As MapReduce use cases diversify 
and increase in scale, it becomes vital to develop workload 
management techniques that can target each specific work- 
load. 

5.2 Burstiness 

Figure 7 also reveals bursty submission patterns across 
various dimensions. Burstiness is an often discussed prop- 



erty of time- varying signals, but it is often not precisely mea- 
sured. One common way to attempt to measure it to use the 
peak-to-average ratio. There are also domain-specific met- 
rics, such as for bursty packet loss on wireless links [47]. 
Here, we extend the concept of peak-to-average ratio to 
quantify burstiness. 

We start defining burstiness first by using the median 
rather than the arithmetic mean as the measure of "aver- 
age". Median is statistically robust against data outliers, 
i.e., extreme but rare bursts [30]. For two given workloads 
with the same median load, the one with higher peaks, that 
is, a higher peak-to-median ratio, is more bursty. We then 
observe that the peak-to-median ratio is the same as the 
100 t,l -percentile-to-median ratio. While the median is sta- 
tistically robust to outliers, the 100 t ' i -percentile is not. This 
implies that the 99 th , 95 th , or go'^-percentile should also 
be calculated. We extend this line of thought and compute 
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Figure 9: Correlation between different submission pat- 
tern time series. Showing pair-wise correlation between 
jobs per hour, (input + shuffle -+- output) bytes per hour, 
and (map + reduce) task times per hour. 



Figure 8: Workload burstiness. Showing cummulative 
distribution of task-time (sum of map time and reduce 
time) per hour. To allow comparison between workloads, 
all values have been normalized by the median task-time 
per hour for each workload. For comparison, we also 
show burstiness for artificial sine submit patterns, scaled 
with min-max range the same as mean (sine + 2) and 
10% of mean (sine + 20). 



the general n^-percentile-to-median ratio for a workload. 
We can graph this vector of values, with - — ~ pe . rcent '' e on 
the x-axis, versus n on the y-axis. The resultant graph can 
be interpreted as a cumulative distribution of arrival rates 
per time unit, normalized by the median arrival rate. This 
graph is an indication of how bursty the time series is. A 
more horizontal line corresponds to a more bursty workload; 
a vertical line represents a workload with a constant arrival 
rate. 

Figure 8 graphs this metric for one of the dimensions 
of our workloads. We also graph two different sinusoidal 
signals to illustrate how common signals appear under this 
burstiness metric. Figure 8 shows that for all workloads, the 
highest and lowest submission rates are orders of magnitude 
from the median rate. This indicates a level of burstiness 
far above the workloads examined by prior work, which have 
more regular diurnal patterns [38, 48] . For the workloads 
here, scheduling and task placement policies will be essen- 
tial under high load. Conversely, mechanisms for conserving 
energy will be beneficial during periods of low utilization. 

For the Facebook workloads, over a year, the peak-to- 
median-ratio dropped from 31:1 to 9:1, accompanied by more 
internal organizations adopting MapReduce. This shows 
that multiplexing many workloads (workloads from many 
organizations) help decrease bustiness. However, the work- 
load remains bursty. 

5.3 Time Series Correlations 

We also computed the correlation between the workload 
submission time series in all three dimensions. Specifically, 
we compute three correlation values: between the time- 
varying vectors jobsSubmitted(t) and dataSizeBytes(t), be- 
tween jobs Submitted(t) and computeTimeTaskSeconds(t) , 
and between dataSizeBytes(t) and computeTimeTaskSec- 
onds(t), where t represents time in hourly granularity, and 
ranges over the entire trace duration. 



The results are in Figure 9. The average temporal correla- 
tion between job submit and data size is 0.21; for job submit 
and compute time it is 0.14; for data size and compute time 
it is 0.62. The correlation between data size and compute 
time is by far the strongest. We can visually verify this by 
the 2 nd and 3 rd columns for CC-e in Figure 9. This indicates 
that MapReduce workloads remain data-centric rather than 
compute-centric. Also, schedulers and load balancers need 
to consider dimensions beyond number of active jobs. 

Combined, the observations in this section mean that max- 
imum jobs per second is the wrong performance metric to 
evaluate these systems. The nature of any workload bursts 
depends on the complex aggregate of data and compute 
needs of active jobs at the time, as well as the scheduling, 
placement, and other workload management decisions that 
determine how quickly jobs drain from the system. Any 
efforts to develop a TPC-like benchmark for MapReduce 
should consider a range of performance metrics, and stress- 
ing the system under realistic, multi-dimensional variations 
in workload intensity. 

6. COMPUTATION PATTERNS 

Previous sections looked at data and temporal patterns 
in the workload. As computation is an equally important 
aspect of MapReduce, this section identifies what are the 
common computation patterns for each workload. Specifi- 
cally, we answer questions related to optimizing query-like 
programming frameworks: 

— What % of cluster load come from these frameworks? 

— What are the common uses of each framework? 

We also answer questions with regard to job-level scheduling 
and execution planning: 

— What are the common job types? 

— What are the size, shape, and duration of these jobs? 

— How frequently does each job type appear? 

In traditional RDBMS, one can quantify query types by 
the operator (e.g. join, select), and the cardinality of the 
data processed for a particular query. Each operator can 
be characterized to consume a certain amount of resources 
based on the cardinality of the data they process. The ana- 
log to operators for MapReduce jobs are the map and reduce 
steps, and the cardinality of the data is quantified in our 
analysis by the number of bytes of data for the map input, 
intermediate shuffle, and reduce output stages. 
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We consider two complementary ways of grouping MapRe- 
duce jobs: (1) By the job name strings submitted to MapRe- 
duce, which gives us insights on the use of native MapRe- 
duce versus query-like programatic frameworks on top of 
MapReduce. For some frameworks, this analysis also reveals 
the frequency of the particular query-like operators that are 
used (§ 6.1). (2) By the multi-dimensional job description 
according to per-job data sizes, duration, and task times, 
which serve as a proxy to proprietary code, and indicate the 
size, shape, and duration of each job type (§ 6.2). 

6.1 By Job Names 

Job names are user-supplied strings recorded by MapRe- 
duce. Some computation frameworks built on top of MapRe- 
duce, such as Hive [1], Pig [3], and Oozie [2] generate the job 
names automatically. MapReduce does not currently impose 
any structure on job names. To simplify analysis, we focus 
on the first word of job names, ignoring any capitalization, 
numbers, or other symbols. 

Figure 10 shows the most frequent first words in job names 
for each workload, weighted by number of jobs, the amount 
of I/O, and task-time. The FB-2010 trace does not have this 
information. The top figure shows that the top handful of 
words account for a dominant majority of jobs. When these 
names are weighted by I/O, Hive queries such as insert 
and other data-centric jobs such as data extractors domi- 
nate; when weighted by task-time, the pattern is similar, 
unsurprising given the correlation between I/O and task- 
time. 

Figure 10 also implies that each workload consists of only 
a small number of common computation types. The rea- 
son is that job names are either automatically generated, or 
assigned by human operators using informal but common 
conventions. Thus, jobs with names that begin with the 
same word likely perform similar computation. The small 
number of computation types represent targets for static or 
even manual optimization. This will greatly simplify work- 
load management problems, such as predicting job duration 
or resource use, and optimizing scheduling, placement, or 
task granularity. 

Each workload services only a small number of MapRe- 
duce frameworks: Hive, Pig, Oozie, or similar layers on top 
of MapReduce. Figure 10 shows that for all workloads, two 
frameworks account for a dominant majority of jobs. There 
is ongoing research to achieve well-behaved multiplexing be- 
tween different frameworks [32] . The data here suggests that 
multiplexing between two or three frameworks already cov- 
ers the majority of jobs in all workloads here. We believe 
this observation will remain valid in the future. As new 
frameworks develop, enterprise MapReduce users are likely 
to converge on an evolving but small set of mature frame- 
works for business critical computations. 

Figure 10 also shows that for Hive in particular, select 
and insert form a large fraction of activity for several work- 
loads. Only the FB-2009 workload contains a large fraction 
of Hive queries beginning with from. Unfortunately, this 
information is not available for Pig. Also, we see evidence 
of some direct migration of established RDBMS use cases, 
such as etl (Extract, Transform, Load) and edw (Enterprise 
Data Warehouse). 

This information gives us some idea with regard to good 
targets for query optimization. However, more direct infor- 
mation on query text at the Hive and Pig level will be even 
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Figure 10: The first word of job names for each work- 
load, weighted by the number of jobs beginning with each 
word (top), total I/O in bytes (middle), and map/reduce 
task-time (bottom). For example, 44% of jobs in the 
FB-2009 workload have a name beginning with "ad", a 
further 12% begin with "insert"; 27% of all I/O and 
34% of total task-time comes from jobs with names that 
begin with "from" (middle and bottom). The FB-2010 
trace did not contain job names. 
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more beneficial. For workflow management frameworks such 
as Oozie, it will be benefitial to have UUIDs to identify jobs 
belonging to the same workflow. For native MapReduce 
jobs, it will be desirable for the job names to contain a uni- 
form convention of pre- and postfixes such as dates, com- 
putation types, steps in multi-stage processing, etc.. Ob- 
taining information at that level will help translate insights 
from multi-operator RDBMS query execution planning to 
optimize multi-job MapReduce workflows. 

6.2 By Multi-Dimensional Job Behavior 

Another way to group jobs is by their multi-dimensional 
behavior. Each job can be represented as a six-dimensional 
vector described by input size, shuffle size, output size, job 
duration, map task time, and reduce task time. One way to 
group similarly behaving jobs is to find clusters of vectors 
close to each other in the six-dimensional space. We use a 
standard data clustering algorithm, k- means [9]. K- means 
enables quick analysis of a large number of data points and 
facilitates intuitive labeling and interpretation of cluster cen- 
ters [17, 18,41]. 

We use a standard technique to choose k, the number 
of job type clusters for each workload: increment k until 
there is diminishing return in the decrease of intra-cluster 
variance, i.e., residual variance. Our previous work [17,18] 
contains additional details of this methodology. 

Table 2 summarizes our k-means analysis results. We 
have assigned labels using common terminology to describe 
the one or two data dimensions that separate job categories 
within a workload. A system optimizer would use the full 
numerical descriptions of cluster centroids. 

We see that jobs touching <10GB of total data make up 
>92% of all jobs. These jobs are capable of achieving in- 
teractive latency for analysts, i.e., durations of less than a 
minute. The dominance of these jobs counters prior assump- 
tions that MapReduce workloads consist of only jobs at TB 
scale and beyond. The observations validate research efforts 
to improve the scheduling time and the interactive capability 
of large-scale computation frameworks [14,33,39]. 

The dichotomy between very small and very large job has 
been identified previously for workload management of busi- 
ness intelligence queries [35] . Drawing on the lessons learned 
there, poor management of a single large job potentially im- 
pacts performance for a large number of small jobs. 

The small-big job dichotomy implies that the cluster should 
be split into two tiers. There should be (1) a performance 
tier, which handles the interactive and semi-streaming com- 
putations and likely benefits from optimizations for interac- 
tive RDBMS systems, and (2) a capacity tier, which nec- 
essarily trades performance for efficiency in using storage 
and computational capacity. The capacity tier likely as- 
sumes batch-like semantics. One can view such a setup as 
analogous to multiplexing OLTP (interactive transactional) 
and OLAP (potentially batch analytical) workloads. It is 
important to operate both parts of the cluster while simul- 
taneously achieving performance and efficiency goals. 

The dominance of small jobs complicates efforts to rein in 
stragglers [10], tasks that execute significantly slower than 
other tasks in a job and delay job completion. Compar- 
ing the job duration and task time columns indicate that 
small jobs contain only a handful of small tasks, sometimes 
a single map task and a single reduce task. Having few 
comparable tasks makes it difficult to detect stragglers, and 



also blurs the definition of a straggler. If the only task of 
a job runs slowly, it becomes impossible to tell whether the 
task is inherently slow, or abnormally slow. The importance 
of stragglers as a problem also requires re-assessment. Any 
stragglers will seriously hamper jobs that have a single wave 
of tasks. However, if it is the case that stragglers occur ran- 
domly with a fixed probability, fewer tasks per job means 
only a few jobs would be affected. We do not yet know 
whether stragglers occur randomly. 

Interestingly, map functions in some jobs aggregate data, 
reduce functions in other jobs expand data, and many jobs 
contain data transformations in either stage. Such data ra- 
tios reverse the original intuition behind map functions as 
expansions, i.e., "maps", and reduction functions as aggre- 
gates, i.e., "reduces" [22]. 

Also, map-only jobs appear in all but two workloads. 
They form 7% to 77% of all bytes, and 4% to 42% of all 
task times in their respective workloads. Some are Oozie 
launcher jobs and others are maintenance jobs that oper- 
ate on very little data. Compared with other jobs, map- 
only jobs benefit less from datacenter networks optimized 
for shuffle patterns [7,8,20,29]. 

Further, FB-2010 and CC-c both contain jobs that handle 
roughly the same amount of data as others, but take consid- 
erably longer to complete versus jobs in the same workload 
with comparable data sizes. FB-2010 contains a job type 
that consumes only 10s of GB of data, but requires days to 
complete (Map only transform, 3 days). These jobs have 
inherently low levels of parallelism, and cannot take advan- 
tage of parallelism on the cluster, even if spare capacity is 
available. 

Comparing the FB-2009 and FB-2010 workloads in Table 2 
shows that job types at Facebook changed significantly over 
one year. The small jobs remain, and several kinds of map- 
only jobs remain. However, the job profiles changed in sev- 
eral dimensions. Thus, for Facebook, any policy parameters 
need to be periodically revisited. 

Combined, the analysis once again reveals the diversity 
across workloads. Even though small jobs dominate all seven 
workloads, they are "small" in different ways for each work- 
load. Further, the breadth of job shape, size, and durations 
across workloads indicates that microbenchmarks of a hand- 
ful of jobs capture only a small sliver of workload activity, 
and a truly representative benchmark will need to involve a 
much larger range of job types. 

7. TOWARDS A BIG DATA BENCHMARK 

In light of the broad spectrum of industrial data presented 
in this paper, it is natural to ask what implications we 
can draw with regard to building a TPC-style benchmark 
for MapReduce and similar big data systems. The work- 
loads here are sufficient to characterize an emerging class 
of MapReduce workloads for interactive and semi-streaming 
analysis. However, the diversity of behavior across the work- 
loads we analyzed means we should be careful when deciding 
which aspects of this behavior are representative enough to 
include in a benchmark. Below, we discuss some challenges 
associated with building a TPC-style benchmark for MapRe- 
duce and other big data systems. 
Data generation. 

The range of data set sizes, skew in access frequency, and 
temporal locality in data access all affect system perfor- 
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Table 2: Job types in each workload as identified by k-means clustering, with cluster sizes, centers, and labels. Map 
and reduce time are in task-seconds, i.e., a job with 2 map tasks of 10 seconds each has map time of 20 task-seconds. 
Note that the small jobs dominate all workloads. 



mance. A good benchmark should stress the system with 
realistic conditions in all these areas. Consequently, a bench- 
mark needs to pre-generate data that accurately reflects the 
complex data access patterns of real life workloads. 
Processing generation. 

The analysis in this paper reveals challenges in accurately 
generating a processing stream that reflects real life work- 
loads. Such a processing stream needs to capture the size, 
shape, and sequence of jobs, as well as the aggregate clus- 
ter load variation over time. It is non-trivial to tease out 
the dependencies between various features of the processing 
stream, and even harder to understand which ones we can 
omit for a large range of performance comparison scenarios. 
Mixing MapReduce and query-like frameworks. 

The heavy use of query-like frameworks on top of MapRe- 
duce indicates that future cluster management systems need 
to efficiently multiplex jobs both written in the native MapRe- 
duce API, and from query-like frameworks such as Hive, Pig, 
and HBase. Thus, a representative benchmark also needs to 



include both types of processing, and multiplex them in re- 
alistic mixes. 

Scaled-down workloads. 

The sheer data size involved in the workloads means that 
it is economically challenging to reproduce workload behav- 
ior at production scale. One can scale down workloads pro- 
portional to cluster size. However, there are many ways to 
describe both cluster and workload size. One could normal- 
ize workload size parameters such as data size, number of 
jobs, or the processing per data, against cluster size param- 
eters such as number of nodes, CPU capacity, or available 
memory. It is not clear yet what would be the best way to 
scale down a workload. 
Empirical models. 

The workload behaviors we observed do not fit any well- 
known statistical distributions (the single exception being 
Zipf distribution in data access frequency). It is necessary 
for a benchmark to assume an empirical model of workloads, 
i.e., the workload traces are the model. This is a departure 
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from some existing TPC-* benchmarking approaches, where 
the targeted workload are such that some simple models can 
be used to generate data and the processing stream [51]. 
A true workload perspective. 

The data in the paper indicate the shortcomings of mi- 
crobenchmarks that execute a small number of jobs one at 
a time. They are useful for diagnosing subcomponents of a 
system subject to very specific processing needs. A big data 
benchmark should assume the perspective already reflected 
in TPC-* [50], and treat a workload as a steady process- 
ing stream involving the superposition of many processing 
types. 

Workload suites. 

The workloads we analyzed exhibit a wide range of be- 
havior. If this diversity is preserved across more workloads, 
we would be compelled to accpet that no single set of be- 
haviors are representative. In that case, we would need to 
identify as small suite of workload classes that cover a large 
range of behavior. The benchmark would then consist not 
of a single workload, but a workload suite. Systems could 
trade optimized performance for one workload type against 
more average performance for another. 
A stopgap tool. 

We have developed and deployed Statistical Workload In- 
jector for MapReduce (https://github.com/SWIMProject- 
UCB/SWIM/wiki). This is a set of New BSD Licensed work- 
load replay tools that partially address the above challenges. 
The tools can pre-populate HDFS using uniform synthetic 
data, scaled to the number of nodes in the cluster, and replay 
the workload using synthetic MapReduce jobs. The work- 
load replay methodology is further discussed in [18]. The 
SWIM repository already includes scaled-down versions of 
the FB-2009 and FB-2010 workloads. Cloudera has allowed 
us to contact the end customers directly and seek permission 
to make public their traces. We hope the replay tools can 
act as a stop-gap while we progress towards a more thorough 
benchmark, and the workload repository can contribute to 
a scientific approach to designing big data systems such as 
MapReduce. 

8. SUMMARY AND CONCLUSIONS 

To summarize the analysis results, we directly answer the 
questions raised in Section 2.1. The observed behavior spans 
a wide range across workloads, as we detail below. 

1. For optimizing the underlying storage system: 

— Skew in data accesses frequencies range between an 
80-1 and 80-8 rule. 

— Temporal locality exists, and 80% of data re-accesses 
occur on the range of minutes to hours. 

2. For workload-level provisioning and load shaping: 

— The cluster load is bursty and unpredictable. 

— Peak-to-median ratio in cluster load range from 9:1 
to 260:1. 

3. For job-level scheduling and execution planning: 

— All workloads contain a range of job types, with the 
most common being small jobs. 

— These jobs are small in all dimensions compared 
with other jobs in the same workload. They involve 
10s of KB to GB of data, exhibit a range of data 
patterns between the map and reduce stages, and 
have durations of 10s of seconds to a few minutes. 



— The small jobs form over 90% of all jobs for all work- 
loads. The other job types appear with a wide range 
of frequencies. 

4. For optimizing query-like programming frameworks: 

— The cluster load that comes from these frameworks 
is up to 80% and at least 20%. 

— The frameworks are generally used for interactive 
data exploration and semi-streaming analysis. For 
Hive, the most commonly used operators are insert 
and select; from is frequently used in only one 
workload. Additional tracing at the Hive/Pig/HBase 
level is required. 

5. For performance comparison between systems: 

— A wide variation in behavior exists between work- 
loads, as the above data indicates. 

— There is sufficient diversity between workloads that 
we should be cautious in claiming any behavior as 
"typical" . Additional workload studies are required. 

The analysis in this paper has several repercussions: (1). 
MapReduce has evolved to the point where performance 
claims should be qualified with the underlying workload 
assumptions, e.g., by replaying a suite of workloads. (2). 
System engineers should regularly re-assess design priorities 
subject to evolving use cases. Prerequisites to these efforts 
are workload replay tools and a public workload repository, 
so that engineers can share insights across different enter- 
prise MapReduce deployments. 

Future work should seek to improve analysis and mon- 
itoring tools. Enterprise MapReduce monitoring tools [21] 
should perform workload analysis automatically, present gra- 
phical results in a dashboard, and ship only the anonymized 
and aggregated metrics for workload comparisons offsite. 
Most importantly, tracing capabilities at the Hive, Pig, and 
HBase level should be improved. An analysis of query text 
at that level will reveal further insights, and expedite trans- 
lating RDBMS knowledge to optimize MapReduce and solve 
real life problems involving large-scale data. 

Improved tools will facilitate the analysis of more work- 
loads, over longer time periods, and for additional statistics. 
This improves the quality and generality of the derived de- 
sign insights, and contributes to the overall efforts to identify 
common behavior. The data in this paper indicate that we 
need to look at a broader range of use cases before we can 
build a truly representative big data benchmark. 

We invite cluster operators and the broader data manage- 
ment community to share additional knowledge about their 
MapReduce workloads. To contribute, retain the job history 
logs generated by existing Hadoop tools, run the tools at 
https : //github . com/SWIMPro j ectUCB/SWIM/wiki/Analyz- 
e-historical-cluster-traces-and-synthesize-represe- 
ntative-workload, and share the results. 
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